BIL476 - BANKING DATASET ANALYSIS¶

Upload The Banking Dataset¶

In Banking Dataset Analysis Project, the dataset used is "Bank Marketing" dataset from UCI Machine Learning Repository.

In [ ]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
print(bank_marketing.metadata) 

# variable information  
display(bank_marketing.variables) 
{'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'title': 'A data-driven approach to predict the success of bank telemarketing', 'authors': 'Sérgio Moro, P. Cortez, P. Rita', 'published_in': 'Decision Support Systems', 'year': 2014, 'url': 'https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e', 'doi': '10.1016/j.dss.2014.03.001'}, 'additional_info': {'summary': "The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. \n\nThere are four datasets: \n1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]\n2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.\n3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). \n4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). \nThe smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM). \n\nThe classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'Input variables:\n   # bank client data:\n   1 - age (numeric)\n   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",\n                                       "blue-collar","self-employed","retired","technician","services") \n   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)\n   4 - education (categorical: "unknown","secondary","primary","tertiary")\n   5 - default: has credit in default? (binary: "yes","no")\n   6 - balance: average yearly balance, in euros (numeric) \n   7 - housing: has housing loan? (binary: "yes","no")\n   8 - loan: has personal loan? (binary: "yes","no")\n   # related with the last contact of the current campaign:\n   9 - contact: contact communication type (categorical: "unknown","telephone","cellular") \n  10 - day: last contact day of the month (numeric)\n  11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")\n  12 - duration: last contact duration, in seconds (numeric)\n   # other attributes:\n  13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n  14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)\n  15 - previous: number of contacts performed before this campaign and for this client (numeric)\n  16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")\n\n  Output variable (desired target):\n  17 - y - has the client subscribed a term deposit? (binary: "yes","no")\n', 'citation': None}}
name role type demographic description units missing_values
0 age Feature Integer Age None None no
1 job Feature Categorical Occupation type of job (categorical: 'admin.','blue-colla... None no
2 marital Feature Categorical Marital Status marital status (categorical: 'divorced','marri... None no
3 education Feature Categorical Education Level (categorical: 'basic.4y','basic.6y','basic.9y'... None no
4 default Feature Binary None has credit in default? None no
5 balance Feature Integer None average yearly balance euros no
6 housing Feature Binary None has housing loan? None no
7 loan Feature Binary None has personal loan? None no
8 contact Feature Categorical None contact communication type (categorical: 'cell... None yes
9 day_of_week Feature Date None last contact day of the week None no
10 month Feature Date None last contact month of year (categorical: 'jan'... None no
11 duration Feature Integer None last contact duration, in seconds (numeric). ... None no
12 campaign Feature Integer None number of contacts performed during this campa... None no
13 pdays Feature Integer None number of days that passed by after the client... None yes
14 previous Feature Integer None number of contacts performed before this campa... None no
15 poutcome Feature Categorical None outcome of the previous marketing campaign (ca... None yes
16 y Target Binary None has the client subscribed a term deposit? None no

Attributes in Data

In [ ]:
display(X)
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome
0 58 management married tertiary no 2143 yes no NaN 5 may 261 1 -1 0 NaN
1 44 technician single secondary no 29 yes no NaN 5 may 151 1 -1 0 NaN
2 33 entrepreneur married secondary no 2 yes yes NaN 5 may 76 1 -1 0 NaN
3 47 blue-collar married NaN no 1506 yes no NaN 5 may 92 1 -1 0 NaN
4 33 NaN single NaN no 1 no no NaN 5 may 198 1 -1 0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45206 51 technician married tertiary no 825 no no cellular 17 nov 977 3 -1 0 NaN
45207 71 retired divorced primary no 1729 no no cellular 17 nov 456 2 -1 0 NaN
45208 72 retired married secondary no 5715 no no cellular 17 nov 1127 5 184 3 success
45209 57 blue-collar married secondary no 668 no no telephone 17 nov 508 4 -1 0 NaN
45210 37 entrepreneur married secondary no 2971 no no cellular 17 nov 361 2 188 11 other

45211 rows × 16 columns

The Target Value in Data

In [ ]:
display(y)
y
0 no
1 no
2 no
3 no
4 no
... ...
45206 yes
45207 yes
45208 yes
45209 no
45210 no

45211 rows × 1 columns

Concat Attribute and the Target Value for Common Processing Steps

In [ ]:
import pandas as pd
df = pd.concat([X, y], axis=1)
display(df)
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no NaN 5 may 261 1 -1 0 NaN no
1 44 technician single secondary no 29 yes no NaN 5 may 151 1 -1 0 NaN no
2 33 entrepreneur married secondary no 2 yes yes NaN 5 may 76 1 -1 0 NaN no
3 47 blue-collar married NaN no 1506 yes no NaN 5 may 92 1 -1 0 NaN no
4 33 NaN single NaN no 1 no no NaN 5 may 198 1 -1 0 NaN no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45206 51 technician married tertiary no 825 no no cellular 17 nov 977 3 -1 0 NaN yes
45207 71 retired divorced primary no 1729 no no cellular 17 nov 456 2 -1 0 NaN yes
45208 72 retired married secondary no 5715 no no cellular 17 nov 1127 5 184 3 success yes
45209 57 blue-collar married secondary no 668 no no telephone 17 nov 508 4 -1 0 NaN no
45210 37 entrepreneur married secondary no 2971 no no cellular 17 nov 361 2 188 11 other no

45211 rows × 17 columns

EDA¶

Check Missing Values

In [ ]:
null_values =df.isnull().sum()
null_values_df = pd.DataFrame(null_values, columns=['Total Null Values'])
print("TOTAL NULL VALUES PER ATTRIBUTE:")
display(null_values_df)
TOTAL NULL VALUES PER ATTRIBUTE:
Total Null Values
age 0
job 288
marital 0
education 1857
default 0
balance 0
housing 0
loan 0
contact 13020
day_of_week 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 36959
y 0

Descriptive Statistics of Data

In [ ]:
import pandas as pd

descriptive_stats = pd.DataFrame(df.describe())
descriptive_stats = descriptive_stats.style.set_caption("Descriptive Statistics")
print("Descriptive Statistics")
display( descriptive_stats)
Descriptive Statistics
Descriptive Statistics
  age balance day_of_week duration campaign pdays previous
count 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000
mean 40.936210 1362.272058 15.806419 258.163080 2.763841 40.197828 0.580323
std 10.618762 3044.765829 8.322476 257.527812 3.098021 100.128746 2.303441
min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000
50% 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000
75% 48.000000 1428.000000 21.000000 319.000000 3.000000 -1.000000 0.000000
max 95.000000 102127.000000 31.000000 4918.000000 63.000000 871.000000 275.000000

Mode Values Per Attribute and Target Value

In [ ]:
modes = df.mode()
modes = modes.style.set_caption("Mode Values")
print("Modes")
display(modes)
Modes
Mode Values
  age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome y
0 32 blue-collar married secondary no 0 yes no cellular 20 may 124 1 -1 0 failure no

HISTOGRAMS & PAIR PLOTS¶

Histogram Distributions Of Attributes

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

def plot_histograms(data, plot_params, cols=4):
    # Calculate the number of rows needed
    num_plots = len(plot_params)
    rows = (num_plots // cols) + (num_plots % cols > 0)
    
    # Create subplots
    fig, axes = plt.subplots(rows, cols, figsize=(cols * 5, rows * 3))
    
    # Flatten axes array for easy iteration
    axes = axes.flatten()

    for i, (column, params) in enumerate(plot_params.items()):
        rotation = params[2] if len(params) == 3 else 0
        sns.histplot(data=data, x=column, bins=20, kde=True, ax=axes[i])
        axes[i].set_title(params[0])
        axes[i].set_xlabel(params[1])
        axes[i].set_ylabel('Count')
        if rotation:
            axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=rotation, ha='right')

    # Remove any unused subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.show()

# Dictionary to store plot parameters for each column
plot_params = {
    'age': ('Distribution of Customer Ages', 'Age'),
    'job': ('Distribution of Customer Jobs', 'Job', 45),
    'marital': ('Distribution of Customer Marital Status', 'Marital'),
    'education': ('Distribution of Customer Education Levels', 'Education Level'),
    'default': ('Presence of Outstanding Debt Distribution', 'Outstanding Debt'),
    'balance': ('Distribution of Average Annual Salary', 'Annual Salary'),
    'housing': ('Presence of Housing Loan Distribution', 'Housing Loan'),
    'loan': ('Presence of Personal Loan Distribution', 'Personal Loan'),
    'contact': ('Distribution of Communication Channels', 'Communication Channels'),
    'day_of_week': ('Distribution of the Last Contact Days of Weeks', 'Day'),
    'month': ('Distribution of the Last Contact Months of Years', 'Month'),
    'duration': ('Distribution of Communication Time in Seconds', 'Time in Seconds'),
    'campaign': ('Distribution of the Numbers of Contacts', 'Contacts'),
    'pdays': ('Distribution of Numbers of Days Since the Last Contacts', 'Days'),
    'previous': ('Distribution of Numbers of Contacts Before the Campaign', 'Contacts'),
    'poutcome': ('Distribution of the Results of the Previous Campaign', 'Results')
}

# Plot histograms in a grid layout
plot_histograms(X, plot_params)
C:\Users\elifr\AppData\Local\Temp\ipykernel_1684\3588507636.py:22: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=rotation, ha='right')
No description has been provided for this image

Histogram Distribution of the Target Value

In [ ]:
plt.figure(figsize=(5, 3))
sns.histplot(data=y, x='y', bins=20, kde=True)
plt.title('Distribution of Target Value')
plt.xlabel('Target Value')
plt.ylabel('Count')
plt.show()
No description has been provided for this image

Pair Plots

In [ ]:
print("PAIR PLOTS OF ATTRIBUTES\n")
sns.pairplot(df, diag_kind='kde', hue='y')
plt.suptitle('PAIR PLOTS OF ATTRIBUTES', y=1.02)
plt.show()
PAIR PLOTS OF ATTRIBUTES

No description has been provided for this image

BOXPLOT GENERATION¶

As it can be understood from the definition of some attributes and their distributions, it is not that logical to draw the boxplot. Even though this is the case, boxplot generation for each attribute has been performed in order not to miss anything.

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

# Get the list of all columns in the X 
columns = X.columns

# Number of columns for the subplot grid
cols = 4  # Adjust this number based on how many columns you want per row
rows = (len(columns) // cols) + (len(columns) % cols > 0)

# Create subplots
fig, axes = plt.subplots(rows, cols, figsize=(cols * 5, rows * 4))

# Flatten axes array for easy iteration
axes = axes.flatten()

for i, column in enumerate(columns):
    sns.boxplot(x=X[column], ax=axes[i])
    axes[i].set_title(f'Boxplot of {column.capitalize()}')
    axes[i].set_xlabel(column.capitalize())

# Remove any unused subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()
No description has been provided for this image

Correlation Heatmap¶

In [ ]:
df_encoded = pd.get_dummies(df, drop_first=True)

# Check for and handle missing values
df_encoded = df_encoded.dropna()

# Calculate the correlation matrix
corr_matrix = df_encoded.corr()

# Generate the heatmap
plt.figure(figsize=(20, 15))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
No description has been provided for this image

Check & Handle Outliers

In [ ]:
import numpy as np

u, ind = np.unique(y, return_inverse=True)
plt.scatter(ind, df['balance'])

plt.show()
No description has been provided for this image
In [ ]:
plt.scatter( df['duration'], df['balance'])
Out[ ]:
<matplotlib.collections.PathCollection at 0x28002d36410>
No description has been provided for this image
In [ ]:
plt.scatter(df['age'], df['balance'])
Out[ ]:
<matplotlib.collections.PathCollection at 0x2807f2955d0>
No description has been provided for this image
In [ ]:
X = X[X['balance'] <= 55000]
plt.figure(figsize=(4, 4))
sns.boxplot(df['balance'])
plt.show()
No description has been provided for this image

Education - Job¶

In [ ]:
sns.boxplot(data=df, x="education", y="job")
Out[ ]:
<Axes: xlabel='education', ylabel='job'>
No description has been provided for this image

NaN Değer Kontrolü

In [ ]:
import pandas as pd

nan_counts = {column: df[column].isna().sum() for column in df.columns}
nan_counts_df = pd.DataFrame([nan_counts], index=['NaN Counts'])

display(nan_counts_df)
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome y
NaN Counts 0 288 0 1857 0 0 0 0 13020 0 0 0 0 0 0 36959 0

Fill the Null Values in Education

In [ ]:
X.loc[(X['education'].isnull()) & (X['job'] == 'admin.'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'blue-collar'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'entrepreneur'), 'education'] = 'tertiary'
X.loc[(X['education'].isnull()) & (X['job'] == 'housemaid'), 'education'] = 'primary'
X.loc[(X['education'].isnull()) & (X['job'] == 'management'), 'education'] = 'tertiary'
X.loc[(X['education'].isnull()) & (X['job'] == 'retired'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'self-employed'), 'education'] = 'tertiary'
X.loc[(X['education'].isnull()) & (X['job'] == 'services'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'student'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'technician'), 'education'] = 'secondary'
X.loc[(X['education'].isnull()) & (X['job'] == 'unemployed'), 'education'] = 'secondary'

Null Value Control After Filling in the 'education' Column According to the 'education'-'job' Relationship

In [ ]:
job_counts = df['job'].value_counts(dropna=False)
plt.figure(figsize = (6, 6))
plt.pie(job_counts, labels=job_counts.index, autopct='%1.1f%%')

education_counts = df['education'].value_counts(dropna=False)
plt.figure(figsize = (6, 6))
plt.pie(education_counts, labels=education_counts.index, autopct='%1.1f%%')

poutcome_counts = df['poutcome'].value_counts(dropna=False)
plt.figure(figsize = (6, 6))
plt.pie(poutcome_counts, labels=poutcome_counts.index, autopct='%1.1f%%')
plt.show()

pdays_counts = df['pdays'].value_counts(dropna=False)
plt.figure(figsize=(6, 6))
plt.pie(pdays_counts, labels=pdays_counts.index, autopct='%1.1f%%')
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
nan_counts = {column: X[column].isna().sum() for column in X.columns}
nan_counts_df = pd.DataFrame([nan_counts], index=['NaN Counts'])

display(nan_counts_df)
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome
NaN Counts 0 288 0 127 0 0 0 0 13018 0 0 0 0 0 0 36948
In [ ]:
df = df.drop(columns=['contact', 'poutcome'])
In [ ]:
nan_counts = {column: df[column].isna().sum() for column in df.columns}
nan_counts_df = pd.DataFrame([nan_counts], index=['NaN Counts'])

display(nan_counts_df)
age job marital education default balance housing loan day_of_week month duration campaign pdays previous y
NaN Counts 0 288 0 1857 0 0 0 0 0 0 0 0 0 0 0
In [ ]:
df = df.dropna()
In [ ]:
nan_counts = {column: df[column].isna().sum() for column in df.columns}
nan_counts_df = pd.DataFrame([nan_counts], index=['NaN Counts'])

display(nan_counts_df)
age job marital education default balance housing loan day_of_week month duration campaign pdays previous y
NaN Counts 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [ ]:
display(df)
age job marital education default balance housing loan day_of_week month duration campaign pdays previous y
0 58 management married tertiary no 2143 yes no 5 may 261 1 -1 0 no
1 44 technician single secondary no 29 yes no 5 may 151 1 -1 0 no
2 33 entrepreneur married secondary no 2 yes yes 5 may 76 1 -1 0 no
5 35 management married tertiary no 231 yes no 5 may 139 1 -1 0 no
6 28 management single tertiary no 447 yes yes 5 may 217 1 -1 0 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45206 51 technician married tertiary no 825 no no 17 nov 977 3 -1 0 yes
45207 71 retired divorced primary no 1729 no no 17 nov 456 2 -1 0 yes
45208 72 retired married secondary no 5715 no no 17 nov 1127 5 184 3 yes
45209 57 blue-collar married secondary no 668 no no 17 nov 508 4 -1 0 no
45210 37 entrepreneur married secondary no 2971 no no 17 nov 361 2 188 11 no

43193 rows × 15 columns

In [ ]:
min_value = df['day_of_week'].min()
max_value = df['day_of_week'].max()

min_max_df = pd.DataFrame({
    'Statistic': ['Minimum deÄŸer', 'Maksimum deÄŸer'],
    'day_of_week Value': [min_value, max_value]
})

display(min_max_df)
Statistic day_of_week Value
0 Minimum deÄŸer 1
1 Maksimum deÄŸer 31
In [ ]:
job_value_counts = df['job'].value_counts()
job_counts_df = pd.DataFrame(list(job_value_counts.items()), columns=['Job', 'Counts'])
display(job_counts_df.T)
0 1 2 3 4 5 6 7 8 9 10
Job blue-collar management technician admin. services retired self-employed entrepreneur unemployed housemaid student
Counts 9278 9216 7355 5000 4004 2145 1540 1411 1274 1195 775
In [ ]:
categorical_columns = df.select_dtypes(include=['object']).columns
print("Kategorik verili sütunlar:")
categorical = pd.DataFrame(categorical_columns, columns=['Categorical Columns'])
display(categorical.T)
Kategorik verili sütunlar:
0 1 2 3 4 5 6 7
Categorical Columns job marital education default housing loan month y
In [ ]:
mon_value = df['month'].unique()
mon_value_df = pd.DataFrame(list(mon_value), columns=['Month'])
display(mon_value_df.T)
0 1 2 3 4 5 6 7 8 9 10 11
Month may jun jul aug oct nov dec jan feb mar apr sep

ENCODING¶

In [ ]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

encoder = LabelEncoder()
for column in ['job', 'marital', 'education', 'default', 'housing', 'loan']:
    df[column] = encoder.fit_transform(df[column])

display(X.head())
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome
0 58 management married tertiary no 2143 yes no NaN 5 may 261 1 -1 0 NaN
1 44 technician single secondary no 29 yes no NaN 5 may 151 1 -1 0 NaN
2 33 entrepreneur married secondary no 2 yes yes NaN 5 may 76 1 -1 0 NaN
3 47 blue-collar married secondary no 1506 yes no NaN 5 may 92 1 -1 0 NaN
4 33 NaN single NaN no 1 no no NaN 5 may 198 1 -1 0 NaN
In [ ]:
month_mapping = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
df['month'] = df['month'].map(month_mapping)
In [ ]:
df = df.sample(frac=1).reset_index(drop=True)
display(df.head())
age job marital education default balance housing loan day_of_week month duration campaign pdays previous y
0 26 4 1 2 0 877 1 0 22 7 611 1 -1 0 no
1 46 4 2 2 0 -867 1 1 21 5 222 5 -1 0 no
2 30 4 2 2 0 19796 0 0 20 11 41 1 -1 0 no
3 33 1 2 1 0 56 1 0 29 1 290 4 -1 0 no
4 43 9 2 1 0 336 1 0 9 5 181 1 -1 0 no
In [ ]:
df['y'] = df['y'].map({'no': 0, 'yes': 1})
display(df.head())
age job marital education default balance housing loan day_of_week month duration campaign pdays previous y
0 26 4 1 2 0 877 1 0 22 7 611 1 -1 0 0
1 46 4 2 2 0 -867 1 1 21 5 222 5 -1 0 0
2 30 4 2 2 0 19796 0 0 20 11 41 1 -1 0 0
3 33 1 2 1 0 56 1 0 29 1 290 4 -1 0 0
4 43 9 2 1 0 336 1 0 9 5 181 1 -1 0 0

NORMALIZATION - ROBUST SCALING¶

In [ ]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df = scaler.fit_transform(df)
In [ ]:
from sklearn.model_selection import train_test_split

X_features = df[:, :-1]
y_target = df[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.2, random_state=2)

X_df = pd.DataFrame(X_features)
In [ ]:
from sklearn.cluster import KMeans

km = KMeans(n_clusters=2,
            init='k-means++', 
            n_init=10,
            max_iter=100, 
            random_state=42)

clusters_predict = km.fit_predict(X_df)
In [ ]:
# Calculation the principal components in 2D and 3D

import prince
import plotly.express as px

def get_pca_2d(df, predict):

    pca_2d_object = prince.PCA(
    n_components=2,
    n_iter=3,
    rescale_with_mean=True,
    rescale_with_std=True,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
    )

    pca_2d_object.fit(df)

    df_pca_2d = pca_2d_object.transform(df)
    df_pca_2d.columns = ["comp1", "comp2"]
    df_pca_2d["cluster"] = predict

    return pca_2d_object, df_pca_2d



def get_pca_3d(df, predict):

    pca_3d_object = prince.PCA(
    n_components=3,
    n_iter=3,
    rescale_with_mean=True,
    rescale_with_std=True,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
    )

    pca_3d_object.fit(df)

    df_pca_3d = pca_3d_object.transform(df)
    df_pca_3d.columns = ["comp1", "comp2", "comp3"]
    df_pca_3d["cluster"] = predict

    return pca_3d_object, df_pca_3d



def plot_pca_3d(df, title = "PCA Space", opacity=0.8, width_line = 0.1):

    df = df.astype({"cluster": "object"})
    df = df.sort_values("cluster")

    columns = df.columns[0:3].tolist()

    fig = px.scatter_3d(df, 
                        x=columns[0], 
                        y=columns[1], 
                        z=columns[2],
                        color='cluster',
                        template="plotly",
                        
                        # symbol = "cluster",
                        
                        color_discrete_sequence=px.colors.qualitative.Vivid,
                        title=title).update_traces(
                            # mode = 'markers',
                            marker={
                                "size": 4,
                                "opacity": opacity,
                                # "symbol" : "diamond",
                                "line": {
                                    "width": width_line,
                                    "color": "black",
                                }
                            }
                        ).update_layout(
                                width = 1000, 
                                height = 800, 
                                autosize = False, 
                                showlegend = True,
                                legend=dict(title_font_family="Times New Roman",
                                            font=dict(size= 20)),
                                scene = dict(xaxis=dict(title = 'comp1', titlefont_color = 'black'),
                                            yaxis=dict(title = 'comp2', titlefont_color = 'black'),
                                            zaxis=dict(title = 'comp3', titlefont_color = 'black')),
                                font = dict(family = "Gilroy", color  = 'black', size = 15))
                      
    
    fig.show()

pca_3d_object, df_pca_3d = get_pca_3d(X_df, clusters_predict)
plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=1, width_line = 0.1)
print("The variability is :", pca_3d_object.eigenvalues_summary)
The variability is :           eigenvalue % of variance % of variance (cumulative)
component                                                    
0              1.653        11.81%                     11.81%
1              1.510        10.78%                     22.59%
2              1.355         9.68%                     32.28%
In [ ]:
def plot_pca_2d(df, title = "PCA Space", opacity=0.8, width_line = 0.1):

    df = df.astype({"cluster": "object"})
    df = df.sort_values("cluster")

    columns = df.columns[0:3].tolist()


    fig = px.scatter(df, 
                        x=columns[0], 
                        y=columns[1],
                        color='cluster',
                        template="plotly",
                        # symbol = "cluster",
                        
                        color_discrete_sequence=px.colors.qualitative.Vivid,
                        title=title).update_traces(
                            # mode = 'markers',
                            marker={
                                "size": 8,
                                "opacity": opacity,
                                # "symbol" : "diamond",
                                "line": {
                                    "width": width_line,
                                    "color": "black",
                                }
                            }
                        ).update_layout(
                                width = 800, 
                                height = 700, 
                                autosize = False, 
                                showlegend = True,
                                legend=dict(title_font_family="Times New Roman",
                                            font=dict(size= 20)),
                                scene = dict(xaxis=dict(title = 'comp1', titlefont_color = 'black'),
                                            yaxis=dict(title = 'comp2', titlefont_color = 'black'),
                                            ),
                                font = dict(family = "Gilroy", color  = 'black', size = 15))
                        
        
    fig.show()
    
pca_2d_object, df_pca_2d = get_pca_2d(X_df, clusters_predict)
plot_pca_2d(df_pca_2d, title = "PCA Space", opacity=1, width_line = 0.5)

MODELLING¶

Machine Learning Models¶

In [ ]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
In [ ]:
 
In [ ]:
# Random Forest
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)  # No need to scale features for Random Forest

lr = LogisticRegression(max_iter=500 , random_state = 2)
lr.fit(X_train,y_train)
Out[ ]:
LogisticRegression(max_iter=500, random_state=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=500, random_state=2)
In [ ]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

rf_predictions = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_recall = recall_score(y_test, rf_predictions, average='macro')
rf_precision = precision_score(y_test, rf_predictions, average='macro')
rf_f1 = f1_score(y_test, rf_predictions, average='macro')

lr_predictions = lr.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)
lr_recall = recall_score(y_test, lr_predictions, average='macro')
lr_precision = precision_score(y_test, lr_predictions, average='macro')
lr_f1 = f1_score(y_test, lr_predictions, average='macro')

Confusion Matrices & Classification Report

In [ ]:
from sklearn.metrics import classification_report, confusion_matrix

rf_conf_matrix = confusion_matrix(y_test, rf_predictions)
rf_class_report = classification_report(y_test, rf_predictions, output_dict=True)
rf_report_df = pd.DataFrame(rf_class_report).transpose()

print("Decision Tree Confusion Matrix:")
display(pd.DataFrame(rf_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("Decision Tree Classification Report:")
display(rf_report_df)

lr_conf_matrix = confusion_matrix(y_test, lr_predictions)
lr_class_report = classification_report(y_test, lr_predictions, output_dict=True)
lr_report_df = pd.DataFrame(lr_class_report).transpose()

print("Logistic Regression Confusion Matrix:")
display(pd.DataFrame(lr_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("Logistic Regression Classification Report:")
display(lr_report_df)
Decision Tree Confusion Matrix:
0.0 1.0
0.0 7469 227
1.0 573 370
Decision Tree Classification Report:
precision recall f1-score support
0.0 0.928749 0.970504 0.949168 7696.000000
1.0 0.619765 0.392365 0.480519 943.000000
accuracy 0.907397 0.907397 0.907397 0.907397
macro avg 0.774257 0.681434 0.714844 8639.000000
weighted avg 0.895022 0.907397 0.898012 8639.000000
Logistic Regression Confusion Matrix:
0.0 1.0
0.0 7548 148
1.0 750 193
Logistic Regression Classification Report:
precision recall f1-score support
0.0 0.909617 0.980769 0.943854 7696.000000
1.0 0.565982 0.204666 0.300623 943.000000
accuracy 0.896053 0.896053 0.896053 0.896053
macro avg 0.737800 0.592718 0.622238 8639.000000
weighted avg 0.872107 0.896053 0.873641 8639.000000
In [ ]:
results = pd.DataFrame({
    'Metric': ['Accuracy', 'Recall', 'Precision', 'F1 Score'],
    'Random Forest': [rf_accuracy, rf_recall, rf_precision, rf_f1],
    'Logistic Regression': [lr_accuracy, lr_recall, lr_precision, lr_f1]
})

# Display the table
display(results)
Metric Random Forest Logistic Regression
0 Accuracy 0.907397 0.896053
1 Recall 0.681434 0.592718
2 Precision 0.774257 0.737800
3 F1 Score 0.714844 0.622238

ROC - AUC Curves

In [ ]:
from sklearn.metrics import roc_curve, roc_auc_score

rf_auc = roc_auc_score(y_test, rf_predictions)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, rf_predictions)
print('ROC_AUC_SCORE for Random Forest is', rf_auc)
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC Curve for Random Forest')
plt.show()

lr_auc = roc_auc_score(y_test, lr_predictions)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, lr_predictions)
print('ROC_AUC_SCORE for Logistic Regressiın is', lr_auc)
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC Curve for Logistic Regression')
plt.show()
ROC_AUC_SCORE for Random Forest is 0.6814344756086538
No description has been provided for this image
ROC_AUC_SCORE for Logistic Regressiın is 0.592717595236153
No description has been provided for this image

Grid Search & Cross-Validation for Hyperparameter Tuning

In [ ]:
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 300, num=3)]  
max_features = ['sqrt'] # # of features to consider at every split
max_depth = [int(x) for x in np.linspace(5, 10, num=2)] # max # of levels in tree
max_depth.append(None)
min_samples_split = [2, 5] # min # of samples required to split a node
min_samples_leaf = [1, 2] # min # of samples required at each leaf node
criterion = ['gini', 'entropy']

rf_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'criterion': criterion}

print("Grid Parameters for Random Forest")
display((rf_grid))

penalty = ["l2"]    # norm of the penalty
C = [0.001, 0.01, 0.1, 1, 10]   # inverse of regularization strength

lr_grid = {"penalty": penalty,
            "C": C}

print("Grid Parameters for Logistic Regression")
display((lr_grid))
Grid Parameters for Random Forest
{'n_estimators': [100, 200, 300],
 'max_features': ['sqrt'],
 'max_depth': [5, 10, None],
 'min_samples_split': [2, 5],
 'min_samples_leaf': [1, 2],
 'criterion': ['gini', 'entropy']}
Grid Parameters for Logistic Regression
{'penalty': ['l2'], 'C': [0.001, 0.01, 0.1, 1, 10]}
In [ ]:
from sklearn.model_selection import GridSearchCV

rf_grid_search = GridSearchCV(RandomForestClassifier(), rf_grid, cv=5, scoring='accuracy', n_jobs=-1)
rf_grid_search.fit(X_train, y_train)
best_rf = rf_grid_search.best_estimator_
print("Best Random Forest Parameters:", rf_grid_search.best_params_)

lr_grid_search = GridSearchCV(LogisticRegression(max_iter=500), lr_grid, cv=5, scoring='accuracy', n_jobs=-1)
lr_grid_search.fit(X_train, y_train)
best_lr = lr_grid_search.best_estimator_
print("Best SVC Parameters:", lr_grid_search.best_params_)
Best Random Forest Parameters: {'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best SVC Parameters: {'C': 0.01, 'penalty': 'l2'}
In [ ]:
rf_params = rf_grid_search.best_params_
best_rf = RandomForestClassifier(**rf_params)
best_rf.fit(X_train, y_train)

lr_params = lr_grid_search.best_params_
best_lr = LogisticRegression(**lr_params, max_iter=500)
best_lr.fit(X_train, y_train)
Out[ ]:
LogisticRegression(C=0.01, max_iter=500)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.01, max_iter=500)
In [ ]:
rf_predictions_grid = best_rf.predict(X_test)
rf_conf_matrix_grid = confusion_matrix(y_test, rf_predictions_grid)
rf_class_report_grid = classification_report(y_test, rf_predictions_grid, output_dict=True)
rf_report_df_grid = pd.DataFrame(rf_class_report_grid).transpose()

rf_accuracy_grid = accuracy_score(y_test, rf_predictions_grid)
rf_recall_grid = recall_score(y_test, rf_predictions_grid, average='macro')
rf_precision_grid = precision_score(y_test, rf_predictions_grid, average='macro')
rf_f1_grid = f1_score(y_test, rf_predictions_grid, average='macro')

print("Random Forest Accuracy Using Grid Search Parameters:", rf_accuracy_grid) 
print("Random Forest Confusion Matrix Using Grid Search Parameters:")
display(pd.DataFrame(rf_conf_matrix_grid, columns=np.unique(y_test), index=np.unique(y_test)))
print("Random Forest Classification Report Using Grid Search Parameters:")
display(rf_report_df_grid) 

lr_predictions_grid = best_lr.predict(X_test)

lr_conf_matrix_grid = confusion_matrix(y_test, lr_predictions_grid)
lr_class_report_grid = classification_report(y_test, lr_predictions_grid, output_dict=True)
lr_report_df_grid = pd.DataFrame(lr_class_report_grid).transpose()

lr_accuracy_grid = accuracy_score(y_test, lr_predictions_grid)
lr_recall_grid = recall_score(y_test, lr_predictions_grid, average='macro')
lr_precision_grid = precision_score(y_test, lr_predictions_grid, average='macro')
lr_f1_grid = f1_score(y_test, lr_predictions_grid, average='macro')

print("Logistic Regression Accuracy Using Grid Search Parameters:", lr_accuracy_grid) 
print("Logistic Regression Confusion Matrix Using Grid Search Parameters:")
display(pd.DataFrame(lr_conf_matrix_grid, columns=np.unique(y_test), index=np.unique(y_test)))
print("Logistic Regression Classification Report Using Grid Search Parameters:")
display(lr_report_df_grid)
Random Forest Accuracy Using Grid Search Parameters: 0.9071651811552263
Random Forest Confusion Matrix Using Grid Search Parameters:
0.0 1.0
0.0 7477 219
1.0 583 360
Random Forest Classification Report Using Grid Search Parameters:
precision recall f1-score support
0.0 0.927667 0.971544 0.949099 7696.000000
1.0 0.621762 0.381760 0.473062 943.000000
accuracy 0.907165 0.907165 0.907165 0.907165
macro avg 0.774715 0.676652 0.711080 8639.000000
weighted avg 0.894276 0.907165 0.897136 8639.000000
Logistic Regression Accuracy Using Grid Search Parameters: 0.8958212756106031
Logistic Regression Confusion Matrix Using Grid Search Parameters:
0.0 1.0
0.0 7561 135
1.0 765 178
Logistic Regression Classification Report Using Grid Search Parameters:
precision recall f1-score support
0.0 0.908119 0.982458 0.943827 7696.000000
1.0 0.568690 0.188759 0.283439 943.000000
accuracy 0.895821 0.895821 0.895821 0.895821
macro avg 0.738405 0.585609 0.613633 8639.000000
weighted avg 0.871068 0.895821 0.871742 8639.000000
In [ ]:
results = pd.DataFrame({
    'Metric': ['Accuracy', 'Recall', 'Precision', 'F1 Score'],
    'Random Forest': [rf_accuracy, rf_recall, rf_precision, rf_f1],
    'Logistic Regression': [lr_accuracy, lr_recall, lr_precision, lr_f1],
    'Random Forest Using Grid Search Parameters':[rf_accuracy_grid, rf_recall_grid, rf_precision_grid, rf_f1_grid],
    'Logistic Regression Grid Search Parameters': [lr_accuracy_grid, lr_recall_grid, lr_precision_grid, lr_f1_grid]
})

# Display the table
display(results)
Metric Random Forest Logistic Regression Random Forest Using Grid Search Parameters Logistic Regression Grid Search Parameters
0 Accuracy 0.907397 0.896053 0.907165 0.895821
1 Recall 0.681434 0.592718 0.676652 0.585609
2 Precision 0.774257 0.737800 0.774715 0.738405
3 F1 Score 0.714844 0.622238 0.711080 0.613633

Improve Model Performance Using Ensemble Methods

In [ ]:
from sklearn.ensemble import AdaBoostClassifier, StackingClassifier

# Boosting with AdaBooST using the Best Random Forest
boosting = AdaBoostClassifier(best_rf, n_estimators=50, random_state=2, algorithm="SAMME")
boosting.fit(X_train, y_train)
boosting_predictions = boosting.predict(X_test)
print("Boosting with AdaBoost Accuracy:", accuracy_score(y_test, boosting_predictions))

# Stacking with Decision Tree, Naive Bayes, and SVM
# using decision tree classifier as the final estimator
models = [('rf', best_lr), ('lr', best_lr)]
stacking1 = StackingClassifier(estimators=models, final_estimator=RandomForestClassifier())
stacking1.fit(X_train, y_train)
stacking1_predictions = stacking1.predict(X_test)
print("Stacking Accuracy with Decision Tree as the Final Estimator:", accuracy_score(y_test, stacking1_predictions))

# using naive bayes as the final estimator
stacking2 = StackingClassifier(estimators=models, final_estimator=LogisticRegression())
stacking2.fit(X_train, y_train)
stacking2_predictions = stacking2.predict(X_test)
print("Stacking Accuracy withh Naive Bayes as the Final Estimator:", accuracy_score(y_test, stacking2_predictions))
Boosting with AdaBoost Accuracy: 0.9038083111471235
Stacking Accuracy with Decision Tree as the Final Estimator: 0.8437319134159046
Stacking Accuracy withh Naive Bayes as the Final Estimator: 0.8958212756106031
In [ ]:
boosting_accuracy = accuracy_score(y_test, boosting_predictions)
boosting_recall = recall_score(y_test, boosting_predictions, average='macro')
boosting_precision = precision_score(y_test, boosting_predictions, average='macro')
boosting_f1 = f1_score(y_test, boosting_predictions, average='macro')
boosting_conf_matrix = confusion_matrix(y_test, rf_predictions_grid)
boosting_class_report = classification_report(y_test, rf_predictions_grid, output_dict=True)
boosting_report_df = pd.DataFrame(rf_class_report_grid).transpose()

print("AdaBoost Using Random Forest Estimator Accuracy:", boosting_accuracy) 
print("AdaBoost Using Random Forest Estimator Confusion Matrix:")
display(pd.DataFrame(boosting_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("AdaBoost Using Random Forest Estimator Classification Report:")
display(boosting_report_df) 

stacking1_accuracy = accuracy_score(y_test, stacking1_predictions)
stacking1_recall = recall_score(y_test, stacking1_predictions, average='macro')
stacking1_precision = precision_score(y_test, stacking1_predictions, average='macro')
stacking1_f1 = f1_score(y_test, stacking1_predictions, average='macro')
stacking1_conf_matrix = confusion_matrix(y_test, stacking1_predictions)
stacking1_class_report = classification_report(y_test, stacking1_predictions, output_dict=True)
stacking1_report_df = pd.DataFrame(stacking1_class_report).transpose()

print("Stacking Using Random Forest as the Final Estimator Accuracy:", stacking1_accuracy)
print("Stacking Using Random Forest as the Final Estimator Confusion Matrix:")
display(pd.DataFrame(stacking1_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("Stacking Using Random Forest as the Final Estimator Classification Report:")
display(stacking1_report_df)

stacking2_accuracy = accuracy_score(y_test, stacking2_predictions)
stacking2_recall = recall_score(y_test, stacking2_predictions, average='macro')
stacking2_precision = precision_score(y_test, stacking2_predictions, average='macro')
stacking2_f1 = f1_score(y_test, stacking2_predictions, average='macro')
stacking2_conf_matrix = confusion_matrix(y_test, stacking2_predictions)
stacking2_class_report = classification_report(y_test, stacking2_predictions, output_dict=True)
stacking2_report_df = pd.DataFrame(stacking2_class_report).transpose()

print("Stacking Using Logistic Regression as the Final Estimator Accuracy:", stacking2_accuracy)
print("Stacking Using Logistic Regression as the Final Estimator Confusion Matrix:")
display(pd.DataFrame(stacking2_conf_matrix, columns=np.unique(y_test), index=np.unique(y_test)))
print("Stacking Using Logistic Regression as the Final Estimator Classification Report:")
display(stacking2_report_df)
AdaBoost Using Random Forest Estimator Accuracy: 0.9038083111471235
AdaBoost Using Random Forest Estimator Confusion Matrix:
0.0 1.0
0.0 7477 219
1.0 583 360
AdaBoost Using Random Forest Estimator Classification Report:
precision recall f1-score support
0.0 0.927667 0.971544 0.949099 7696.000000
1.0 0.621762 0.381760 0.473062 943.000000
accuracy 0.907165 0.907165 0.907165 0.907165
macro avg 0.774715 0.676652 0.711080 8639.000000
weighted avg 0.894276 0.907165 0.897136 8639.000000
Stacking Using Random Forest as the Final Estimator Accuracy: 0.8437319134159046
Stacking Using Random Forest as the Final Estimator Confusion Matrix:
0.0 1.0
0.0 6979 717
1.0 633 310
Stacking Using Random Forest as the Final Estimator Classification Report:
precision recall f1-score support
0.0 0.916842 0.906835 0.911811 7696.000000
1.0 0.301850 0.328738 0.314721 943.000000
accuracy 0.843732 0.843732 0.843732 0.843732
macro avg 0.609346 0.617786 0.613266 8639.000000
weighted avg 0.849712 0.843732 0.846635 8639.000000
Stacking Using Logistic Regression as the Final Estimator Accuracy: 0.8958212756106031
Stacking Using Logistic Regression as the Final Estimator Confusion Matrix:
0.0 1.0
0.0 7542 154
1.0 746 197
Stacking Using Logistic Regression as the Final Estimator Classification Report:
precision recall f1-score support
0.0 0.909990 0.979990 0.943694 7696.000000
1.0 0.561254 0.208908 0.304482 943.000000
accuracy 0.895821 0.895821 0.895821 0.895821
macro avg 0.735622 0.594449 0.624088 8639.000000
weighted avg 0.871924 0.895821 0.873920 8639.000000
In [ ]:
results = pd.DataFrame({
    'Metric': ['Accuracy', 'Recall', 'Precision', 'F1 Score'],
    'Random Forest': [rf_accuracy, rf_recall, rf_precision, rf_f1],
    'Logistic Regression': [lr_accuracy, lr_recall, lr_precision, lr_f1],
    'Random Forest Using Grid Search Parameters':[rf_accuracy_grid, rf_recall_grid, rf_precision_grid, rf_f1_grid],
    'Logistic Regression Grid Search Parameters': [lr_accuracy_grid, lr_recall_grid, lr_precision_grid, lr_f1_grid],
    'AdaBoost Using Random Forest Estimator':[boosting_accuracy, boosting_recall, boosting_precision, boosting_f1],
    'Stacking Using Random Forest as the Final Estimator':[stacking1_accuracy, stacking1_recall, stacking1_precision, stacking1_f1],
    'Stacking Using Logistic Regression as the Final Estimator':[stacking2_accuracy, stacking2_recall, stacking2_precision, stacking2_f1]
})

# Display the table
display(results)
Metric Random Forest Logistic Regression Random Forest Using Grid Search Parameters Logistic Regression Grid Search Parameters AdaBoost Using Random Forest Estimator Stacking Using Random Forest as the Final Estimator Stacking Using Logistic Regression as the Final Estimator
0 Accuracy 0.907397 0.896053 0.907165 0.895821 0.903808 0.843732 0.895821
1 Recall 0.681434 0.592718 0.676652 0.585609 0.607771 0.617786 0.594449
2 Precision 0.774257 0.737800 0.774715 0.738405 0.793805 0.609346 0.735622
3 F1 Score 0.714844 0.622238 0.711080 0.613633 0.645077 0.613266 0.624088
In [ ]:
y_test
Out[ ]:
array([1., 0., 0., ..., 0., 0., 1.])

ROC Curves

In [ ]:
#ROC Curves
# Since this is a binary classification, ...


fpr1 , tpr1, thresholds1 = roc_curve(y_test, rf.predict_proba(X_test)[:, 1])
fpr2 , tpr2, thresholds2 = roc_curve(y_test, lr.predict_proba(X_test)[:, 1])
fpr3 , tpr3, thresholds3 = roc_curve(y_test, best_rf.predict_proba(X_test)[:, 1])
fpr4 , tpr4, thresholds4 = roc_curve(y_test, best_lr.predict_proba(X_test)[:, 1])
fpr5 , tpr5, thresholds5 = roc_curve(y_test, boosting.predict_proba(X_test)[:, 1])
fpr6 , tpr6, thresholds6 = roc_curve(y_test, stacking1.predict_proba(X_test)[:, 1])
fpr7 , tpr7, thresholds7 = roc_curve(y_test, stacking2.predict_proba(X_test)[:, 1])


plt.plot([0,1],[0,1], 'k--')
plt.plot(fpr1, tpr1, label= "Random Forest")
plt.plot(fpr2, tpr2, label= "Logistic Regression")
plt.plot(fpr3, tpr3, label= "Random Forest After Grid Search")
plt.plot(fpr4, tpr4, label= "Logistic Regression After Grid Search")
plt.plot(fpr5, tpr5, label= "Boosting")
plt.plot(fpr6, tpr6, label= "Stacking1")
plt.plot(fpr7, tpr7, label= "Stacking2")
plt.legend()
plt.xlabel("FPR")
plt.ylabel("TPR")
plt.title('Receiver Operating Characteristic')
plt.show()
No description has been provided for this image

Accuracy Comparison

In [ ]:
accuracies = [
    rf_accuracy,
    lr_accuracy,
    rf_accuracy_grid,
    lr_accuracy_grid,
    boosting_accuracy,
    stacking1_accuracy,
    stacking2_accuracy
]

# for labeling
model_names = [
    'Random Forest',
    'Logistic Regression',
    'Random Forest After Grid Search',
    'Logistic Regression After Grid Search',
    'Boosting',
    'Stacking1',
    'Stacking2'
]

# Create an accuracy histogram
plt.figure(figsize=(10, 6))
plt.bar(model_names, accuracies, color='skyblue')
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Accuracy Histogram of Models')
plt.xticks(rotation=45)
plt.ylim([min(accuracies) - 0.05, max(accuracies) + 0.05]) 
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

accuracies_df = {"Model Name":model_names, 
                 "Accuracy Value":accuracies}
display(pd.DataFrame(accuracies_df))
No description has been provided for this image
Model Name Accuracy Value
0 Random Forest 0.907397
1 Logistic Regression 0.896053
2 Random Forest After Grid Search 0.907165
3 Logistic Regression After Grid Search 0.895821
4 Boosting 0.903808
5 Stacking1 0.843732
6 Stacking2 0.895821